Least-Squares Temporal Di erence Learning

نویسنده

  • Justin A. Boyan
چکیده

TD( ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD( ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes ine cient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data e ciency. This paper extends Bradtke and Barto's work in three signi cant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a model-based reinforcement learning technique.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of Temporal Di erence Learning with Function Approximation

We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probabili...

متن کامل

An Analysis of Temporal - Di erence Learning with Function Approximation 1

We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with proba...

متن کامل

Analytical Mean Squared Error Curves in Temporal Di erence Learning

We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal di erence value estimation algorithms change with o ine updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its steps...

متن کامل

The Mean and the Variance Matrix of the 'fixed' Gps Baseline

In this contribution we determine the rst two moments of the ' xed' GPS baseline. The rst two moments of the ' oat' solution are well-known. They follow from standard adjustment theory. In order to determine the corresponding moments of the ' xed' solution, the probabilistic characteristics of the integer least-squares ambiguities need to be taken into account. It is shown that the ' xed' GPS b...

متن کامل

Pii: S0165-1684(01)00098-6

A popular technique for time delay estimation is to use an FIR #lter to model the time di-erence and the #lter weights are interpolated with a sinc function to obtain the delay estimate. However, the sinc interpolator requires a su/ciently long #lter length for accurate delay estimation. In this paper, we propose to process the #lter weights via a least-squares-based method in order to acquire ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999